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We study a variant of the Schwarz-preconditioned HMC algorithm. In contrast to the original 
proposal of Liischer, we apply the domain decomposition in one lattice direction only. This is 
sufficient to reduce the condition number of the fermion matrix restricted to the domains com- 
pared with the full fermion matrix. For the same linear extension of the domain, less links reside 
on the boundaries of the domains. Therefore it becomes e.g. practical to iterate the decomposi- 
tion. We perform numerical tests for two degenerate flavours of Wilson fermions. The standard 
Wilson gauge action at j3 = 5.6 is used. The performance of our implementation is compared 
with other recent studies using various types of preconditioning. 
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1. Introduction 

We consider a system with two degenerate flavours of quarks that is defined by the partition 
function 

Z = J D[U]exp(-S G [U]) detM[U] 2 , (1.1) 

where Sg[U] = — §LjtL,u>vR e Tr (u x> /j,U x+ fi tV U^ + ^ ^U x>v \ is the standard Wilson plaquette ac- 
tion, x = (xo,X\,X2,X3) with Xi integer in the range < x,- < Lj are sites on a hyper-cubical lattice, 
jl, V € {0, 1,2,3} are directions on the lattice and jX is a unit vector in pL -direction. The gaugefield 
U x ,\i is an element of the group SU(3). In eq. Ql.l[ ), the fermion degrees of freedom have been 
integrated out, leading to the fermion determinant in the weight. The Wilson fermion matrix is 
given by 

M{U) xy = l-KY,{(l-y^U^x)d x+(liy +(l + y fl )ul(x-fi)d x _ (l>y } , (1.2) 
n 

where the are the euclidian /-matrices, and fc is the so called hopping parameter, which is related 
with the bare mass of the fermions. 

Recently there had been algorithmic progress [jl|, |2], ||, ^ ||] in the simulation of lattice QCD 
at light quark masses. In two flavour simulations, following the determinant of the fermion 
matrix M is represented as detMM^ <=c fD<p f /D0 exp(— |M _1 0| 2 ), where is the pseudo-fermion 
field and S p f = \M~ l (p\ 2 the pseudo-fermion action. The basic idea of [|], ||, [| |], ||] is to chose al- 
ternative representations of the fermion determinant while keeping the Hybrid Monte Carlo (HMC) 
algorithm unchanged otherwise. To this end, the fermion matrix is factorized M = fj; Wi such that 
the factors Wi have a smaller condition number than the fermion matrix M itself. A pseudo-fermion 
field is introduced for each of the factors 

detMM 1 " = JJdetW/Wf ~ J D<j>l J D0j J D0 2 f J D0 2 -J D0,]" J D0„ exp(-^ |^Vi| 2 ) • 

(1.3) 

The effect of this splitting is two-fold: The noise of the stochastic representation of the fermion 
matrix is reduced compared with the standard pseudo-fermion action and furthermore, the splitting 
of the action allows to compute numerically expensive parts less frequently, as suggested in [^]. 

Here we discuss a variant of the Schwarz-preconditioned HMC put forward by Liischer [Q]. 
While in the other cases [jl| |2|, |5|] the factors Wi can be written as a function of the fermion matrix, 
here a spatial decomposition is the basis for the factorization. 

The lattice is decomposed into blocks of the size /o x l\ x l 2 x 1$, with < L^. An approxi- 
mation W\ of M is obtained by eliminating the hopping terms in M that connect different blocks. 
Liischer [Q] made the important observation that det 2 (W 1 _1 M) can be estimated by using a pseudo- 
fermion field that resides on the boundaries of black blocks only (lets assume a red/black decom- 
position of the blocks.). Furthermore in eqs. (3.12,3.13) of [Q] he shows how the force due to the 
pseudo-fermion action for det 2 (W 1 _1 M) can be computed efficiently. In the following we shall use 
these results without any modification; also the result of Appendix B of [Q] is used in the following 
to reduce the dimension of the pseudo-fermion field by half. 

Here we consider a block-decomposition in one dimension only, say the temporal direction. 
I.e. = Lu for jj. = 1,2,3. The reasons to study this special case are the following: a) the 
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implementation becomes much simpler; mainly because there are no sites in a corner of block, 
b) At least for the lattice spacings currently investigated, the fraction of links on the boundary 
between blocks is much less; therefore the number of active links, i.e. those links that take part 
in the molecular-dynamics evolution is larger, c) The simplification enables us to iterate the block 
decomposition. 

Disadvantages of the one-dimensional decomposition are that it is less useful for a massive 
parallelization of the program and what might be more important, for the same lo the condition 
number of W\ might be larger than for a decomposition in all four directions. However the ex- 
perience with Schrodinger functional boundary conditions suggests that still there is a substantial 
reduction of the condition number of W\ compared to M. 

In our numerical experiments, we have iterated the decomposition twice. In the simulations 
discussed below, we have chosen ljp = Lo/2 for the first step and = 1^/2 = Lo/4 for the 
second step of the decomposition. Wi denotes the fermion matrix restricted to the blocks of size Iq . 
For W2 we have used even-odd and mass-preconditioning [jl|]: W^^o = W2, eo + P- the pseudo- 
fermion action consists of four parts: S4, S3, S2, Si representing the squares of the determinant of 
MW{~ , WiW^ 1 , W2 t e Wfg and W3 , eo , respectively. Note the counter-intuitive connection between 
the labels of the S and the W. So is given by the gauge action. 

2. Integration with multiple time scales 

The basic steps of the integration scheme are given by 

T v (At) : U^e iAtP U and T PjJ (At) : P -> P - /At SuSjQJ) , (2.1) 

where 5rj denotes a variation with respect to the gauge fields. From these basic steps we can build 
elementary leap-frog steps 

r LF , () (ATo) = Tpjo Tu{ATo)T Pfi (^) (2.2) 

or steps of an improved scheme (here we follow 

7W.o(At ) = 7> )0 (AAt„) Tv {^f \ 7>.o ([1 - 2A]At ) T v (^j T Pfi (AAt ) (2.3) 

with X = 1/6. Note that in an elementary step of this scheme, the variation of the action with respect 
to the gauge-fields has to be computed twice. This scheme is closely related with the second order 
minimum norm scheme (2MN) studied in The only difference is the choice A « 1/5 in So 
is the part of the action with the largest forces. Elementary integration steps that include parts Sj 
of the action that have smaller forces are now constructed recursively as 

T LFJ (ATj) = T PJ (^fj [T XJ ^(ATj^)T^ T PJ (^fj (2.4) 

in the leapfrog case and 

T SW j(ATj) = T P j(XATj) [Txj-i (At / _i)]">- 1 / 2 T pj ([1 -2k]Azj) [T X j-i(ATj^)] n ^ 2 T P j (AAt/) 

(2.5) 
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in the improved case. The step sizes of the different levels are related as Atj = rij-iAtj-i. In both 
cases X can be either leap frog (LF) or the improved scheme (SW). This means that for different 
time scales, different integration schemes can be used. Here we have used the leapfrog scheme for 
the levels j = 2, 3,4 and the improved one for j = 0,1. A full trajectory is given by Tzj^KAt^" 4 . 

In the case of the Schwarz-preconditioning, the force due to the pseudo-fermion action de- 
pends quite strongly on the position of the gauge link with respect to the boundaries of the blocks. 
I.e. here on xq. Therefore, as discussed in one might chose a step size that depends on the 
position, such that the step size times the force is roughly constant. As we shall see below, the 
force is the largest close to the boundaries of the blocks. Therefore, we have used the following 
schemes: 

(A) In the case of Lq = 24 we have used s(xq) = 0.2, 0.5, 1, 1, 0.5 and 2 for xq = 0, 1, 5 for the 
space-like links and s(xo) = 0.2, 0.5, 1, 0.5, 0.2 and for xq = 0, 1, 5 for time-like links. This 
scheme is then repeated: ^(xo + 6n) = s(xo), where n £ 1,2,3. 

(B) for Lo = 32 is given by s(xq) = 0, 0.5, 1, 1, 1, 1, 0.5 and 0, for xo = 0, 1, 7 for the space-like 
links and s(0) = 0, 0.5, 1, 1, 1, 0.5, and for xo = 0, 1, 7, for time-like links. This scheme is 
then repeated: s(xq + %n) = s(xo), where n E 1,2,3. 

(C) for L = 32 is given by s(0) = 0, 0.25, 0.5, 1,1,1,1, 0.25, 0.25, 1,1,1,1, 0.5, 0.25, 0, for 
x = 0, 1, 15 for the spatial links and s(0) =0, 0,0.25, 0.5, 1, 1 ,0.25,0,0.25, 1, 1, 1,0.5,0.25, 
0, for xq = 0, 1, 15 for the time-like links. For xq > 15: s(xo) = s(xo — 16). 

Note that the blocks of the first decomposition run from xq = up to Lq/2 — 1 and from 
xo = Lq/2 up to Lo — 1. For the scheme (A) the average of s over all links is 0.525. For the schemes 
(B) and (C) it is about 0.59. The actual step size for a given link is At quoted below times s(xq). 
In order to ensure ergodicity of the update, the configuration is shifted in time direction after each 
trajectory. 

3. Numerical results 

We have simulated the Wilson gauge action at /3 =5.6 with Wilson fermions using the values 
of the hopping parameter: K = 0.1575, 0.1580 and 0.15825. These parameters are chosen such 
that we can compare our results with [Q, ^|, [| [KJ. Following the literature, these bare parameters 
correspond roughly to a pseudo-scalar mass of 690 MeV, 490 MeV and 370 MeV. Note that in the 
real world the pion mass is m n ph 135 MeV. The lattice spacing is about 0.8 fm. 

As solver we have used the geometric series for S\ , S2 and S3 and the BiCGstab solver with 
even-odd and Schwarz-preconditioning for S4. The basic parameters of our runs are summarized 
in table [j]. The parameters of the algorithm have been chosen such that roughly the number of 
steps of the solver is the same for each part of the pseudo-fermion action. The typical length of our 
runs is 2000 trajectories after equilibration up to about 5000 trajectories for the runs with L = 12. 
On 8 CPUs (Opteron 2.2 GHz) of a Cray XD1 computer one trajectory for the 32 x 24 3 lattice at 
K = 0.15825 took about 2.5 hours. Note that in our program the Dirac operator runs with less than 
one Gflops per processor and the sub-optimal choice of solver. Our CPU time can be compared 
with about 0.3 hours [Q] (from fig. 7) on 8 nodes with two 2.4 GHz Xeon CPUs each. Note that in 
this case the trajectory length is only z = 0.5 and also the number of active links is about half of 
ours. 
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Table 1: Basic parameters of ourruns. P acc is the acceptance rate at the end of the trajectory. S denotes the 
scheme used for the xq dependence of the step size, p is the parameter of the mass preconditioning. 
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avarage force on spacial links 




Figure 1: We give the average force on spatial links as a function of xq. For a discussion see the text 



In fig. [I] we give the average forces on the spatial links as a function of xq. The largest force is 
obtained for the gauge action. The forces due to S3 and S4 display a strong dependence on xo. They 
are largest at the boundaries between the blocks. In the case of S4, they assume their minimum in 
the middle of the block. In the case of S3 the minimum is located at the boundaries of the blocks 
of the first decomposition. Note that the minimum of the force due to S3 is much smaller than that 
of the force due to S4. 

The step sizes needed to obtain a sufficient acceptance rate can be compared with results from 
the literature. Here we give only a small selection: Using standard HMC, the authors of [10] need 
the step size At = 0.006 for k = 0.1580 on a 32 x 16 3 lattice to get P acc = 0.66. Note that in this 
case the pseudo-fermion action is computed with the fermion matrix itself and not with the even- 
odd preconditioned one. Our most difficult case, the 32 x 24 3 lattice at fc = 0.15825 we compare 
with [Q] who needs At = 0.05 to reach P acc = 0.86 and [Bj, using mass preconditioning, where 
At = 0.1 is needed to get P acc = 0.8. In [Q] At = 0.25 is used in combination with a fourth order 
minimal norm integrator. 
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16 A 3 x 32 beta=5.6 kappa=0.1575 



16 A 3x32 beta=5.6 kappa=0.1575 




24 A 3x32 beta=5.6 kappa=0. 15825 



24 A 3 x 32 beta=5.6 kappa=0. 15825 




Figure 2: History of the plaquette average and the number of steps of the solver. The red line gives the 



average of the plaquette obtained in | |10[ in the case of L = 16 and K = 0.1575 and [[4|, y_2|] in the case of 
L = 24 and K = 0.15825. 



In order to judge the performance of an algorithm autocorrelation times for the quantities of 
interest have to be determined. This is however a notoriously hard problem in HMC simulations of 
QCD with dynamical fermions. 

In fig. H we give the evolution of the plaquette value and the number of steps taken by the 
solver for the simulation of a 32 x 16 3 lattice at K = 0.1575. The run started from a configuration 
equilibrated by a different version of HMC algorithm. The plots give no indication for autocorrela- 
tion times that are comparable with the length of the run itself. We get Zp = 8(2) and z so / v = 16(5) 
as integrated autocorrelation times of the plaquette and the number of steps of the solver. The time 
unit is given by a trajectory. These numbers can be compared with Zp = 7(4) and z so i v = 33(4) for a 
standard HMC simulation [10] and Zp = 68(25) and z so i v = 168(42) for a Schwarz preconditioned 
HMC simulation [|J]. Note that in [Q] the trajectory length is z = 0.5 and only about 37% of the 
links are active. This might trivially explain a factor of about 4 compared with our simulation. One 
also should note that the authors of [11] find that even larger trajectory lengths such as z = 2 are 
advisable to obtain optimal performance. 

In the case of L = 24 and K = 0.15825 we do not quote values for autocorrelation times. 
The time histories of the average plaquette and the number of solver steps suggests that there are 
correlations that are comparable with the length of our run or even larger. Note that the authors of 
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[Q, |5], ^, 12] do not see such problems and quote rather small values of the autocorrelation times. 
Taking into account the length of the trajectory and the fraction of active links, our run is of similar 
length as that of [|2|]. One should take into account the possibility that [Q, ||, [), 12] do not see these 
large autocorrelations since their runs are too short. 

4. Conclusions and outlook 

Using preconditioned pseudo-fermion actions [111 |2[ [| ||, |5]] the problem that the step size 
needed to obtain a reasonable acceptance rate decreases with decreasing fermion mass seems to 
be overcome. The performance of the different proposals seems to be quite similar. Still the 
dependence of autocorrelation times related to small eigenvalues of the fermion matrix on the 
choice of the pseudo-fermion action is not well understood. To this end, it might be useful to 
monitor e.g. the topological charge. Likely also much longer runs then those presented here and 
in ® ^, 12] are needed to this end. A disadvantage of the Schwarz-preconditioning is that it is 



quite hard to implement fermion actions that are more complicated than clover-improved Wilson 
fermions. Since in the case of Schwarz-preconditioning the pseudo-fermions reside on boundaries 
only, it is possible that the performance of the HMC scales differently (hopefully better) with the 
lattice spacing than for the other types of preconditioning. 

References 

[1] M. Hasenbusch, Phys.Lett. B 519 (2001) 177 [arXiv:hep-lat/0107019]. 

[2] M. Peardon, Nucl.Phys.B (Proc.Suppl.) 106&107 (2002) 3 [arXiv:hep-lat/0201003]. 

[3] M. Hasenbusch and K. Jansen, Nucl.Phys.B 659 (2003) 299 [arXiv:hep-lat/021 1042]. 

[4] M. Luscher, JHEP 0305 (2003) 052 [arXiv:hep-lat/0304007]; Comput.Phys.Commun. 165 (2005) 199 
[arXiv:hep-lat/0409106]. 

[5] M.A. Clark and A.D. Kennedy, Phys.Rev.Lett. 98 (2007) 051601 [arXiv:hep-lat/0608015]. 

[6] D. Weingarten and D. Petcher, Phys. Lett. B 99 (1981) 333. 

[7] J.C. Sexton and D.H. Weingarten, Nucl.Phys.B 380 (1992) 665. 

[8] T. Takaishi and P. de Forcrand, Phys.Rev.E 73 (2006) 036706 [arXiv:hep-lat/0505020]. 

[9] C. Urbach, K. Jansen, A. Shindler and U. Wenger, Comput.Phys.Commun. 174 (2006) 87 
[arXiv:hep-lat/0506011]. 

[10] B. Orth, T. Lippert and K. Schilling, Phys.Rev.D 72 (2005) 014503 [arXiv:hep-lat/0503016] and refs 
therein. 

[11] H.B. Meyer, H. Simma, R. Sommer, M. Delia Morte, O. Witzel and U. Wolff, 
Comput.Phys.Commun. 176 (2007) 91 [arXiv:hep-lat/0606004]. 

[12] L. Del Debbio, L. Giusti, M. Luscher, R. Petronzio and N. Tantalo, 

JHEP 0602 (2006) 01 1 [arXiv:hep-lat/05 12021]; JHEP 0702 (2007) 056 [arXiv:hep-lat/06 10059]; 
M. Luscher, private communication. 



7 



